Summary

### What are Policy Gradient Methods?

Policy-based methods are a class of algorithms that search directly for the optimal policy, without simultaneously maintaining value function estimates.
Policy gradient methods are a subclass of policy-based methods that estimate the weights of an optimal policy through gradient ascent.
In this lesson, we represent the policy with a neural network, where our goal is to find the weights \theta of the network that maximize expected return.

The policy gradient method will iteratively amend the policy network weights to:
- make (state, action) pairs that resulted in positive return more likely, and
- make (state, action) pairs that resulted in negative return less likely.

A trajectory \tau is a state-action sequence s_0, a_0, \ldots, s_H, a_H, s_{H+1} .
In this lesson, we will use the notation R(\tau) to refer to the return corresponding to trajectory \tau .
Our goal is to find the weights \theta of the policy network to maximize the expected return U(\theta) := \sum_\tau \mathbb{P}(\tau;\theta)R(\tau) .

Use the policy \pi_\theta to collect m trajectories { \tau^{(1)}, \tau^{(2)}, \ldots, \tau^{(m)}} with horizon H . We refer to the i -th trajectory as
\tau^{(i)} = (s_0^{(i)}, a_0^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, s_{H+1}^{(i)})
.
Use the trajectories to estimate the gradient \nabla_\theta U(\theta) :
\nabla_\theta U(\theta) \approx \hat{g} := \frac{1}{m}\sum_{i=1}^m \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) R(\tau^{(i)})
Update the weights of the policy:
\theta \leftarrow \theta + \alpha \hat{g}
Loop over steps 1-3.

We derived the likelihood ratio policy gradient : \nabla_\theta U(\theta) = \sum_\tau \mathbb{P}(\tau;\theta)\nabla_\theta \log \mathbb{P}(\tau;\theta)R(\tau) .
We can approximate the gradient above with a sample-weighted average:
\nabla_\theta U(\theta) \approx \frac{1}{m}\sum_{i=1}^m \nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta)R(\tau^{(i)})
.
We calculated the following:
\nabla_\theta \log \mathbb{P}(\tau^{(i)};\theta) = \sum_{t=0}^{H} \nabla_\theta \log \pi_\theta (a_t^{(i)}|s_t^{(i)})
.

REINFORCE can solve Markov Decision Processes (MDPs) with either discrete or continuous action spaces.